Citi Bike, New York City’s bike share system, has released public data since July, 2013. As a cyclist, I found interesting to investigage the publicly available data. I use the data for July 2016 for my investigation. The data contains 1,380,110 columns and some variables including trip duration, start time, stop time, station latitude, and station longitude among 15 variables. R is a statistical computing laguage I used for this analysis. Among some packages for R, gglot2 and ggmap were very useful for visualization.
My focuses in the investigaiton are mainly around:
I initially approached the data by looking from bike station standpoint. There are 483 bike staions in NYC and I assume there are popular stations and not so popular stations probably because of their locations. I used a rental bike service some times when I visited cities in the U.S. and outside. For me, I looked for a bike station closed to a train station or a hotel that I stayed.
Look at the chart which represents a count of bike rentals per station in each day of a week. There are many outliers in each day. Outliers are shown above whiskers on the boxes. Take a little bit closer look. In Monday morning, a mean value for rentals per station is 20. However, a station has the largest number of rentals actually got 278 rentals. This is more than 10 times of the mean. In addition, we can see some outliers with a value from 100 to 200. It seems it worth explore more on outliers.
Sidenote1: Since this dataset contains all transactions in the month of July 2016, each day of the week contains four or five days of the day. For example, Sunday contains transactions in Sundays 3rd, 10th, 17th, 24th, and 31st of July.
Another interesting observation is that Saturday and Sunday show a similar pattern and so do Tues, Wednesday, and Thursday. Rentals are slow in the morning on Saturday and Sunday with a narrower IQR and with relatively fewer outlieres. Rentals go up toward afternoon and night. For Tuesday, Weneday, and Thursday, rentals are higher in the morning and night, and are lower in the afternoon with fewer outliers and lower maximum.
Sidenote2: IQR, the interquatile range is a measure of variability. Quartiles divide a rank-ordered data set into four equal parts: Q1, Q2, Q3, and Q4.
The interquartile range is equal to Q3 minus Q1.
Think about outliers articulated in the previous chart. Those outliers are stations renting out a lot of bikes for customers. Customers rent a bike and return a bike. That means there are stations receiving bikes from customers. I want to investigate outliers of bike stations not only for departing (renting) but also stations for arriving (receiving) bikes.
I calculated a number of departing bikes and arriving bikes for each of 483 stations and picked up 100 stations with the highest number of transactions for the month. Then, I plotted them on the map as below, using a longitude and a latitude data for each bike station. The map tells a location of each stations and its volume of transactions, however, not very insightful.
Step back and think about how to analyze popular stations further. Look at the first chart again, number of bike rentals changes during a day and it seems worth confirming if popular stations change during a day. For example, there might be morning bike stations where people rent bikes in the morning but people don’t come after the noon. Another angle to investigate is to classify stations into two categories: arriving stations and departing stations. This might uncover interesting facts.
In the three maps above, blue dots represent the top 50 popular departing stations and red dots repesent the top 50 popular arriving stations during a day. Again, a size of dots reflect a number of renting and returing a bike. The new visualizations provides us more insights than the previous one. First, arriving stations and departing stations create clusters separately. In the morning, we can observe three(or four) clusters for arriving stations and two clusters for departing stations. In the afternoon, there are two big clusters to the North. In the night, clusters appear similar locations to the ones in the morning but locations for arriving stations and locations for departing stations switched.
Sidenote3: I calculated “surplus” and “deficit” to come up with the departing stations and arriving stations. “Surplus” is calculated for each station by a number of rentals subtracted by a number of receivals. “Deficit” is an opposite of “surplus”, a number of receivals subtracted by a rentals. If there is “surplus” for a station, a station is renting bikes more than receiving it. I define this type of station as an arriving station and station with “deficit” as a departing station.
One way to interpret this pattern in the morning and in the night is that customers rent a bike at East Village, Pennsylvania Station and around (blue bubble clusters) and bike to Midtwon, Lower Manhattan, and so on (red bubble clusters). I assume this is mostly for commute because areas under blue bubble clusters have lot of offices. In the night, customers head back from their office (blue bubble clusters) to their home, their transit metro stations or others (red bubble clusters).
To validate this idea, let’s look at types of customers at popular stations. As shown in below, I popularted charts for both popular arriving stations and popular departing stations with customer breakdown: subscriber (blue in the chart below) and customer (red). Most of popular stations are dominated by subscribers especially morning and night at arriving stations. This supports my idea of bike users using a bike for their commute because subscribers are supposed to use a bike regularly and they pick up and drop off a bike at certain stations.
Let’s pick some stations to see where they are located and think about what’s going on at the stations.
We found out some interesting patterns in bike rentals through visualizations and analysis. Logical next step is to squeeze values out of the findings. Before diving into some ideas to turn data into values, let’s understand who is involved in this bike rental service. The service operator, subscribers, customers (one-time users), businesses on a bike route and around a station, and so on. Then, think about their potential demands and pains for the bike rental service. For this time, I foucsed on the service operator, subscribers, and customers and list up some questions to think about.
The service operator: - How to handle fluctuating demands throughout a day at 483 bike stations? - More specificially, how to manage rental requests at popular departing stations? - Should they provide same service to a subscriber and a customer? - How active are subscribers using the service? Who are dormant customers and why? - There are so many bikes (in fact 8143 bikes). Which bike should they check and repair?
Subscribers: - How to commute safely by bike? - Communute is not exciting. Is there anything to do to make commute fun? - What time is a best time for communite? How is a traffic at the time?
Customers: - Don’t know where to go and see as a tourist. Where should I go by bike? - Where is a bike station nearby? - Why bike? What is an advantage of renting a bike over taking a Uber/Lyft?
Some questions are quite important business questions to answer and others are not so much. Let’s focus on some important questions the dataset might help to answer and provide values for them. Below I summarized ideas about how to solve the selected questions by using the dataset.